{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing for numerical features\n",
    "\n",
    "In this notebook, we still use numerical features only.\n",
    "\n",
    "Here we introduce these new aspects:\n",
    "\n",
    "* an example of preprocessing, namely **scaling numerical variables**;\n",
    "* using a scikit-learn **pipeline** to chain preprocessing and model training.\n",
    "\n",
    "## Data preparation\n",
    "\n",
    "First, let's load the full adult census dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now drop the target from the data we use to train our predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we select only the numerical columns, as seen in the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
    "\n",
    "data_numeric = data[numerical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can divide our dataset into a train and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data_numeric, target, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model fitting with preprocessing\n",
    "\n",
    "A range of preprocessing algorithms in scikit-learn allow us to transform the\n",
    "input data before training a model. In our case, we will standardize the data\n",
    "and then train a new logistic regression model on that new version of the\n",
    "dataset.\n",
    "\n",
    "Let's start by printing some statistics about the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the dataset's features span across different ranges. Some\n",
    "algorithms make some assumptions regarding the feature distributions and\n",
    "normalizing features is usually helpful to address such assumptions.\n",
    "\n",
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p>Here are some reasons for scaling features:</p>\n",
    "<ul class=\"last simple\">\n",
    "<li>Models that rely on the distance between a pair of samples, for instance\n",
    "k-nearest neighbors, should be trained on normalized features to make each\n",
    "feature contribute approximately equally to the distance computations.</li>\n",
    "<li>Many models such as logistic regression use a numerical solver (based on\n",
    "gradient descent) to find their optimal parameters. This solver converges\n",
    "faster when the features are scaled, as it requires less steps (called\n",
    "<strong>iterations</strong>) to reach the optimal solution.</li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "Whether or not a machine learning model requires scaling the features depends\n",
    "on the model family. Linear models such as logistic regression generally\n",
    "benefit from scaling the features while other models such as decision trees do\n",
    "not need such preprocessing (but would not suffer from it).\n",
    "\n",
    "We show how to apply such normalization using a scikit-learn transformer\n",
    "called `StandardScaler`. This transformer shifts and scales each feature\n",
    "individually so that they all have a 0-mean and a unit standard deviation.\n",
    "We recall that transformers are estimators that have a `transform` method.\n",
    "\n",
    "We now investigate different steps used in scikit-learn to achieve such a\n",
    "transformation of the data.\n",
    "\n",
    "First, one needs to call the method `fit` in order to learn the scaling from\n",
    "the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(data_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fit` method for transformers is similar to the `fit` method for\n",
    "predictors. The main difference is that the former has a single argument (the\n",
    "data matrix), whereas the latter has two arguments (the data matrix and the\n",
    "target).\n",
    "\n",
    "![Transformer fit diagram](../figures/api_diagram-transformer.fit.svg)\n",
    "\n",
    "In this case, the algorithm needs to compute the mean and standard deviation\n",
    "for each feature and store them into some NumPy arrays. Here, these statistics\n",
    "are the model states.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">The fact that the model states of this scaler are arrays of means and standard\n",
    "deviations is specific to the <tt class=\"docutils literal\">StandardScaler</tt>. Other scikit-learn\n",
    "transformers may compute different statistics and store them as model states,\n",
    "in a similar fashion.</p>\n",
    "</div>\n",
    "\n",
    "We can inspect the computed means and standard deviations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler.mean_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler.scale_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">scikit-learn convention: if an attribute is learned from the data, its name\n",
    "ends with an underscore (i.e. <tt class=\"docutils literal\">_</tt>), as in <tt class=\"docutils literal\">mean_</tt> and <tt class=\"docutils literal\">scale_</tt> for the\n",
    "<tt class=\"docutils literal\">StandardScaler</tt>.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scaling the data is applied to each feature individually (i.e. each column in\n",
    "the data matrix). For each feature, we subtract its mean and divide by its\n",
    "standard deviation.\n",
    "\n",
    "Once we have called the `fit` method, we can perform data transformation by\n",
    "calling the method `transform`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train_scaled = scaler.transform(data_train)\n",
    "data_train_scaled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's illustrate the internal mechanism of the `transform` method and put it\n",
    "to perspective with what we already saw with predictors.\n",
    "\n",
    "![Transformer transform\n",
    "diagram](../figures/api_diagram-transformer.transform.svg)\n",
    "\n",
    "The `transform` method for transformers is similar to the `predict` method for\n",
    "predictors. It uses a predefined function, called a **transformation\n",
    "function**, and uses the model states and the input data. However, instead of\n",
    "outputting predictions, the job of the `transform` method is to output a\n",
    "transformed version of the input data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the method `fit_transform` is a shorthand method to call successively\n",
    "`fit` and then `transform`.\n",
    "\n",
    "![Transformer fit_transform diagram](../figures/api_diagram-transformer.fit_transform.svg)\n",
    "\n",
    "In scikit-learn jargon, a **transformer** is defined as an estimator (an\n",
    "object with a `fit` method) supporting `transform` or `fit_transform`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train_scaled = scaler.fit_transform(data_train)\n",
    "data_train_scaled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "By default, all scikit-learn transformers output NumPy arrays. Since\n",
    "scikit-learn 1.2, it is possible to set the output to be a pandas dataframe,\n",
    "which makes data exploration easier as it preserves the column names. The\n",
    "method `set_output` controls this behaviour. Please refer to this [example\n",
    "from the scikit-learn\n",
    "documentation](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html)\n",
    "for more options to configure the output of transformers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = StandardScaler().set_output(transform=\"pandas\")\n",
    "data_train_scaled = scaler.fit_transform(data_train)\n",
    "data_train_scaled.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the mean of all the columns is close to 0 and the standard\n",
    "deviation in all cases is close to 1. We can also visualize the effect of\n",
    "`StandardScaler` using a jointplot to show both the histograms of the\n",
    "distributions and a scatterplot of any pair of numerical features at the same\n",
    "time. We can observe that `StandardScaler` does not change the structure of\n",
    "the data itself but the axes get shifted and scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# number of points to visualize to have a clearer plot\n",
    "num_points_to_plot = 300\n",
    "\n",
    "sns.jointplot(\n",
    "    data=data_train[:num_points_to_plot],\n",
    "    x=\"age\",\n",
    "    y=\"hours-per-week\",\n",
    "    marginal_kws=dict(bins=15),\n",
    ")\n",
    "plt.suptitle(\n",
    "    \"Jointplot of 'age' vs 'hours-per-week' \\nbefore StandardScaler\", y=1.1\n",
    ")\n",
    "\n",
    "sns.jointplot(\n",
    "    data=data_train_scaled[:num_points_to_plot],\n",
    "    x=\"age\",\n",
    "    y=\"hours-per-week\",\n",
    "    marginal_kws=dict(bins=15),\n",
    ")\n",
    "_ = plt.suptitle(\n",
    "    \"Jointplot of 'age' vs 'hours-per-week' \\nafter StandardScaler\", y=1.1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can easily combine sequential operations with a scikit-learn `Pipeline`,\n",
    "which chains together operations and is used as any other classifier or\n",
    "regressor. The helper function `make_pipeline` creates a `Pipeline`: it\n",
    "takes as arguments the successive transformations to perform, followed by the\n",
    "classifier or regressor model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "model = make_pipeline(StandardScaler(), LogisticRegression())\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `make_pipeline` function did not require us to give a name to each step.\n",
    "Indeed, it was automatically assigned based on the name of the classes\n",
    "provided; a `StandardScaler` step is named `\"standardscaler\"` in the resulting\n",
    "pipeline. We can check the name of each steps of our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.named_steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This predictive pipeline exposes the same methods as the final predictor:\n",
    "`fit` and `predict` (and additionally `predict_proba`, `decision_function`, or\n",
    "`score`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "model.fit(data_train, target_train)\n",
    "elapsed_time = time.time() - start"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can represent the internal mechanism of a pipeline when calling `fit` by\n",
    "the following diagram:\n",
    "\n",
    "![pipeline fit diagram](../figures/api_diagram-pipeline.fit.svg)\n",
    "\n",
    "When calling `model.fit`, the method `fit_transform` from each underlying\n",
    "transformer (here a single transformer) in the pipeline is called to:\n",
    "\n",
    "- learn their internal model states\n",
    "- transform the training data. Finally, the preprocessed data are provided to\n",
    "  train the predictor.\n",
    "\n",
    "To predict the targets given a test set, one uses the `predict` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted_target = model.predict(data_test)\n",
    "predicted_target[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's show the underlying mechanism:\n",
    "\n",
    "![pipeline predict diagram](../figures/api_diagram-pipeline.predict.svg)\n",
    "\n",
    "The method `transform` of each transformer (here a single transformer) is\n",
    "called to preprocess the data. Note that there is no need to call the `fit`\n",
    "method for these transformers because we are using the internal model states\n",
    "computed when calling `model.fit`. The preprocessed data is then provided to\n",
    "the predictor that outputs the predicted target by calling its method\n",
    "`predict`.\n",
    "\n",
    "As a shorthand, we can check the score of the full predictive pipeline calling\n",
    "the method `model.score`. Thus, let's check the computational and\n",
    "generalization performance of such a predictive pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = model.__class__.__name__\n",
    "score = model.score(data_test, target_test)\n",
    "print(\n",
    "    f\"The accuracy using a {model_name} is {score:.3f} \"\n",
    "    f\"with a fitting time of {elapsed_time:.3f} seconds \"\n",
    "    f\"in {model[-1].n_iter_[0]} iterations\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could compare this predictive model with the predictive model used in the\n",
    "previous notebook which did not scale features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LogisticRegression()\n",
    "start = time.time()\n",
    "model.fit(data_train, target_train)\n",
    "elapsed_time = time.time() - start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = model.__class__.__name__\n",
    "score = model.score(data_test, target_test)\n",
    "print(\n",
    "    f\"The accuracy using a {model_name} is {score:.3f} \"\n",
    "    f\"with a fitting time of {elapsed_time:.3f} seconds \"\n",
    "    f\"in {model.n_iter_[0]} iterations\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that scaling the data before training the logistic regression was\n",
    "beneficial in terms of computational performance. Indeed, the number of\n",
    "iterations decreased as well as the training time. The generalization\n",
    "performance did not change since both models converged.\n",
    "\n",
    "<div class=\"admonition warning alert alert-danger\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Warning</p>\n",
    "<p class=\"last\">Working with non-scaled data will potentially force the algorithm to iterate\n",
    "more as we showed in the example above. There is also the catastrophic\n",
    "scenario where the number of required iterations is larger than the maximum\n",
    "number of iterations allowed by the predictor (controlled by the <tt class=\"docutils literal\">max_iter</tt>)\n",
    "parameter. Therefore, before increasing <tt class=\"docutils literal\">max_iter</tt>, make sure that the data\n",
    "are well scaled.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we:\n",
    "\n",
    "* saw the importance of **scaling numerical variables**;\n",
    "* used a **pipeline** to chain scaling and logistic regression training."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}